This article details a new plugin, llm-video-frames, that allows users to feed video files into long context vision LLMs (like GPT-4.1) by converting them into a sequence of JPEG frames. It showcases how to install and use the plugin, provides examples with the Cleo video, and discusses the cost and technical details of the process. It also covers the development of the plugin using an LLM and highlights other features in LLM 0.25.
LLM 0.17 release enables multi-modal input, allowing users to send images, audio, and video files to Large Language Models like GPT-4o, Llama, and Gemini, with a Python API and cost-effective pricing.
The author records a screen capture of their Gmail account and uses Google Gemini to extract numeric values from the video.